13 research outputs found

    Argo: A Time-Elastic Time-Division-Multiplexed NOC using Asynchronous Routers

    No full text

    Εργαλείο ΕΔΑ για χρονική ανάλυση, βελτιστοποίηση και χρονικής επικύρωσης ασύγχρονων κυκλωμάτων

    No full text
    Τα σύγχρονα κυκλώματα, από τα μέσα της δεκαετίας του '80, απολαμβάνουν μια διαρκώς ωριμάζουσα ροή εργαλείων Ηλεκτρονικού Σχεδιαστικού Αυτοματισμού (EDA), η οποία αρχικά κατέστησε δυνατή την υλοποίηση ολοκληρωμένων κυκλωμάτων με πολλά εκατομμύρια τρανζίστορ, ενώ σήμερα διατηρεί τον ρυθμό ανάπτυξης της ηλεκτρονικής βιομηχανίας. Οι δύο "ακρογωνιαίοι λίθοι" του EDA είναι η Χρονική Ανάλυση καί η Βελτιστοποίηση των κυκλωμάτων βάση αυτής. Δυστυχώς, η συμβατική Στατική Χρονική Ανάλυση, η καθιερωμένη δηλαδή διαδικασία, δεν μπορεί να εφαρμοστεί άμεσα σε ασύγχρονα κυκλώματα, καθώς τα τελευταία περιέχουν πάντα ανάδραση και έτσι είναι κυκλικά, δηλαδή κλειστού βρόχου. Ο κύριος λόγος για τον οποίο δεν χρησιμοποιούνται προσεγγίσεις Ασύγχρονης Σχεδίασης είναι η έλλειψη ολοκληρωμένων και βιώσιμων αυτοματοποιημένων ροών. Η παρούσα εργασία παρουσιάζει έναν ολοκληρωμένο αλγόριθμο για Ασύγχρονη Χρονική Ανάλυση, κατάλληλο για EDA, που μπορεί να εφαρμοστεί για την ανάλυση του χρονισμού οποιουδήποτε ασύγχρονου κυκλώματος. Επιπλέον, παρουσιάζεται διαδικασία βετιστοποίησης, η οποία εκτελείται σε κλειστό βρόγχο και καθοδηγείται από τη χρονική ανάλυση. Ο συγκεκριμένος αλγόριθμος χρονικής ανάλυσης έχει τα θεμέλιά του σε προηγούμενη θεωρητική εργασία, η οποία αφορούσε ανάπτυξη αλγορίθμων υπολογισμού ορίων ως προς τη χρονική απόκλιση συμβάντων παράλληλων συστημάτων. Σημαντικός αριθμός από ελλείψεις και ανεπάρκειες της προηγούμενης εργασίας ιασαφηνίστηκαν, ο ορισμός και τα χαρακτηριστικά του αλγορίθμου ολοκληρώθηκαν αλλά και βελτιώθηκαν, και επιτεύχθηκε τελικώς μία αποδοτική υλοποίηση. Αποτελέσματα από την ανάλυση διάφορων ασύγχρονων κυκλωμάτων επιδεικνύουν την αποδοτικότητα του υλοποιημένου αλγορίθμου και την ικανότητά του να βελτιώνει αυτόματα και επιλεκτικά, τα μέν χρονικώς κρίσιμα τμήματα ενός ασύγχρονου κυκλώματος ως προς την καθυστέρηση, τα δέ υπόλοιπα τμήματα ως προς το εμβαδό.Synchronous circuits have enjoyed, since the mid-80's, a constantly maturing EDA tool/flow framework, which enabled the implementation of multi-million transistor chips, and sustains the pace of the electronics industry. The cornerstones of EDA, which triggered its wide adoption are twofold, i.e. Timing Analysis and Timing Analysis-Driven Optimization. Unfortunately, conventional Static Timing Analysis cannot be directly applied to asynchronous circuits, as the latter are closed-loop systems. The primary reason why Asynchronous Design approaches are not attempted today is the lack of any viable and complete EDA flow. This work presents a complete Asynchronous Timing Analysis algorithm implementation, suitable for EDA, which is capable of analyzing the timing of any asynchronous circuit. This work also demonstrates closed-loop Timing Analysis-Driven optimization for asynchronous circuits. The TA algorithm has its foundations in prior theoretical work on algorithms for deriving accurate bounds for the separation time between events of concurrent systems. Several insufficient and incomplete aspects of that work were clarified, completed and improved and a complete and efficient implementation has been achieved. Results on several asynchronous circuits demonstrate the viability of the implemented algorithm, and the capability to automatically optimize selectively the timing-critical subparts of an asynchronous circuit for timing and the other non-timing critical subparts for area

    An area-efficient network interface for a TDM-based Network-on-Chip

    No full text
    Network interfaces (NIs) are used in multi-core systems where they connect processors, memories, and other IP-cores to a packet switched Network-on-Chip (NOC). The functionality of a NI is to bridge between the read/write transaction interfaces used by the cores and the packet-streaming interface used by the routers and links in the NOC. The paper addresses the design of a NI for a NOC that uses time division multiplexing (TDM). By keeping the essence of TDM in mind, we have developed a new area-efficient NI micro-architecture. The new design completely eliminates the need for FIFO buffers and credit based flow control - resources which are reported to account for 50–85% of the area in existing NI designs. The paper discusses the design considerations, presents the new NI micro-architecture, and reports area figures for a range of implementations

    A Min-Heap-Based Accelerator for Deterministic On-the-Fly Pruning in Neural Networks

    No full text
    This paper addresses the design of an area and energy efficient hardware accelerator that supports on-the-fly pruning in neural networks. In a layer of N neurons, the accelerator selects the top K neurons in every timestep. As K is fixed, the runtime of the pruned network is deterministic, which is an important property in real-time systems such as hearing aids. As a first contribution, we propose to use a min-heap for the top K selection due to its efficient data structure and low time complexity. As a second contribution, we design and implement a hardware accelerator for dynamic pruning that is based on the min-heap algorithm. The heap memory storing the top K neurons and their index is realized as a 3-port standard cell-based memory implemented with latches. As a third contribution, we evaluate the energy savings from pruning of a gated recurrent unit used in a neural network for speech enhancement (regression task). Our experiments demonstrate energy savings of ~78% without degrading the SNR improvement, and up to ~93% while reducing the SNR improvement by 0.1 - 1.11 dB. Moreover, the overhead of the hardware accelerator constitutes negligible ~0.5% of the total energy. The accelerator is implemented in a 22nm CMOS process

    A Statically Scheduled Time-Division-Multiplexed Network-on-Chip for Real-Time Systems

    No full text
    Abstract—This paper explores the design of a circuitswitched network-on-chip (NoC) based on time-divisionmultiplexing (TDM) for use in hard real-time systems. Previous work has primarily considered application-specific systems. The work presented here targets general-purpose hardware platforms. We consider a system with IP-cores, where the TDM-NoC must provide directed virtual circuits – all with the same bandwidth – between all nodes. This may not be a frequent scenario, but a general platform should provide this capability, and it is an interesting point in the design space to study. The paper presents an FPGA-friendly hardware design, which is simple, fast, and consumes minimal resources. Furthermore, an algorithm to find minimum-period schedules for all-to-all virtual circuits on top of typical physical NoC topologies like 2D-mesh, torus, bidirectional torus, tree, and fat-tree is presented. The static schedule makes the NoC timepredictable and enables worst-case execution time analysis of communicating real-time tasks. Keywords-real-time systems; network-on-chip I

    Router Designs for an Asynchronous Time-Division-Multiplexed Network-on-Chip

    No full text
    In this paper we explore the design of an asynchronous router for a time-division-multiplexed (TDM) network-on-chip (NOC) that is being developed for a multi-processor platform for hard real-time systems. TDM inherently requires a common time reference, and existing TDM-based NOC designs are either synchronous or mesochronous, but both approaches have their limitations: a globally synchronous NOC is no longer feasible in today's sub micron technologies and a mesochronous NOC requires special FIFO-based synchronizers in all input ports of all routers in order to accommodate for clock phase differences. This adds hardware complexity and increases area and power consumption. We propose to use asynchronous routers in order to achieve a simpler, more robust and globally-asynchronous NOC, and this represents an unexplored point in the design space. The paper presents a range of alternative router designs. All routers have been synthesized for a 65nm CMOS technology, and the paper reports post-layout figures for area, speed and energy and compares the asynchronous designs with an existing mesochronous clocked router. The results show that an asynchronous router is 2 times smaller, marginally slower and with roughly the same energy consumption, while offering a robust solution to the clock distribution problem. The paper further explores clock-gating of the individual pipeline stages in the asynchronous routers, and shows that this can lead to significant power savings. (c) 2013 IEEE

    A Neural Network Engine for Resource Constrained Embedded Systems

    No full text

    Argo: A Real-Time Network-on-Chip Architecture With an Efficient GALS Implementation

    No full text
    In this paper, we present an area-efficient, globally asynchronous, locally synchronous network-on-chip (NoC) architecture for a hard real-time multiprocessor platform. The NoC implements message-passing communication between processor cores. It uses statically scheduled time-division multiplexing (TDM) to control the communication over a structure of routers, links, and network interfaces (NIs) to offer real-time guarantees. The area-efficient design is a result of two contributions: 1) asynchronous routers combined with TDM scheduling and 2) a novel NI microarchitecture. Together they result in a design in which data are transferred in a pipelined fashion, from the local memory of the sending core to the local memory of the receiving core, without any dynamic arbitration, buffering, and clock synchronization. The routers use two-phase bundled-data handshake latches based on the Mousetrap latch controller and are extended with a clock gating mechanism to reduce the energy consumption. The NIs integrate the direct memory access functionality and the TDM schedule, and use dual-ported local memories to avoid buffering, flow-control, and synchronization. To verify the design, we have implemented a 4 times 4 bitorus NoC in 65-nm CMOS technology and we present results on area, speed, and energy consumption for the router, NI, NoC, and postlayout
    corecore